[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models by tlrmchlsmth · Pull Request #24982 · vllm-project/vllm

tlrmchlsmth · 2025-09-16T17:35:18Z

Purpose

Prior to this PR, in many cases, using TP Attn and EP MoEs with --tensor-parallel-size N --data-parallel-size M --enable-expert-parallel would result in factor N redundant work in the MoE layers.

This PR extends #24134 to other models, and to the naive and allgather_reducescatter All2All backends.

Test Plan

vllm serve {{MODEL}} -tp 2 -dp 2 --enable-expert-parallel --port 8192

lm_eval --model local-completions --tasks gsm8k --model_args model={{MODEL}},base_url={{BASE_URL}}/v1/completions,num_concurrent=50,max_retries=3,tokenized_requests=False --limit 100

Test Result

Qwen/Qwen3-30B-A3B-FP8:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.88|±  |0.0327|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|

Qwen/Qwen3-Next-80B-A3B-Instruct (with --enforce-eager due to #25437):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.80|±  |0.0402|
|     |       |strict-match    |     5|exact_match|↑  | 0.74|±  |0.0441|

meta-llama/Llama-4-Scout-17B-16E:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.82|±  |0.0386|
|     |       |strict-match    |     5|exact_match|↑  | 0.82|±  |0.0386|

ibm-granite/granite-4.0-tiny-preview (with --enforce-eager due to #25437 (comment)):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.58|±  |0.0496|
|     |       |strict-match    |     5|exact_match|↑  | 0.55|±  |0.0500|

openai/gpt-oss-20b (main at TP4 is almost the same):

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3685|±  |0.0133|
|     |       |strict-match    |     5|exact_match|↑  |0.2365|±  |0.0117|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify · 2025-09-17T13:07:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Runs but wrong answer in this case Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

After vllm-project/vllm#24982 merged, sequence parallel MOE will be turned on when `enable_expert_parallel=True`, `tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for `VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to support the feature. ```python class ParallelConfig: @Property def use_sequence_parallel_moe(self) -> bool: return (envs.VLLM_ALL2ALL_BACKEND in ("allgather_reducescatter", "naive", "deepep_high_throughput", "deepep_low_latency") and self.enable_expert_parallel and self.tensor_parallel_size > 1 and self.data_parallel_size > 1) ``` Update: No hard requirement on vllm-project/vllm#25828 --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com>

After vllm-project/vllm#24982 merged, sequence parallel MOE will be turned on when `enable_expert_parallel=True`, `tp_size > 1` and `dp_size > 1`. Since for Gaudi, there is no choice for `VLLM_ALL2ALL_BACKEND`, we can not easily bypass it. So this PR aims to support the feature. ```python class ParallelConfig: @Property def use_sequence_parallel_moe(self) -> bool: return (envs.VLLM_ALL2ALL_BACKEND in ("allgather_reducescatter", "naive", "deepep_high_throughput", "deepep_low_latency") and self.enable_expert_parallel and self.tensor_parallel_size > 1 and self.data_parallel_size > 1) ``` Update: No hard requirement on vllm-project/vllm#25828 --------- Signed-off-by: Wuxun Zhang <wuxun.zhang@intel.com> Signed-off-by: Iryna Boiko <iboiko@habana.ai>

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io>

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: simon-mo <simon.mo@hey.com>

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: simon-mo <simon.mo@hey.com>

Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io>

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io>

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Apply TP Attn + EP MoE fix to other models

b83e888

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models labels Sep 16, 2025

tlrmchlsmth added 2 commits September 16, 2025 17:50

llama4

1ae573a

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

llama4 eagle

9b969ee

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify bot added the needs-rebase label Sep 17, 2025

tlrmchlsmth added this to the v0.11.0 milestone Sep 18, 2025

tlrmchlsmth added 5 commits September 18, 2025 19:16

Qwen3-Next

d3bc2cf

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Merge branch 'main' into tp_attn_fix_more_models

f1f3f63

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Use SP for AG RS All2All backend

2ba92e7

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

WIP debugging

84d57d3

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

WIP implementing for naive and ag_rs a2a

42b9a0c

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify bot added llama Related to Llama models speculative-decoding labels Sep 21, 2025

Merge branch 'main' into tp_attn_fix_more_models

9cff96c

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify bot removed the needs-rebase label Sep 21, 2025

tlrmchlsmth added 2 commits September 22, 2025 03:07

WIP making sure it works when the model isn't SP.

d5daf2c

Runs but wrong answer in this case Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Fixes and cleanup

c3d7c76

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth marked this pull request as ready for review September 22, 2025 15:10

tlrmchlsmth requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, sighingnow, simon-mo, yewentao256 and youkaichao as code owners September 22, 2025 15:10

wuxun-zhang mentioned this pull request Sep 28, 2025

Support sequence parallel MOE after upstream #24982 vllm-project/vllm-gaudi#285

Merged

MengqingCao mentioned this pull request Sep 28, 2025

[Bug]:Qwen3MoeSparseMoeBlock.__init__() got an unexpected keyword argument 'config' vllm-project/vllm-ascend#3239

Open

ilmarkov mentioned this pull request Oct 1, 2025

[WideEP] Followup TP Attn + EP MoE fix #26023

Closed

5 tasks

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (vllm-pro…

b106db1

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[Bugfix] Fix Qwen3-VL regression from vllm-project#24982 (vllm-projec…

a0ccd47

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982)

e94aabe

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[Bugfix] Fix Qwen3-VL regression from #24982 (#25814)

495f368

Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: yewentao256 <zhyanwentao@126.com>

jmkuebler mentioned this pull request Oct 8, 2025

[Bug]: EAGLE incompatible w/ Compressed Tensors Quantized Target Model #26402

Closed

1 task

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Bugfix] Fix Qwen3-VL regression from vllm-project#24982 (vllm-projec…

7d68d87

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>

shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Oct 14, 2025

Apply TP Attn + EP MoE fix to other models (vllm-project#24982)

786d867

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: simon-mo <simon.mo@hey.com>

shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Oct 14, 2025

Fix Qwen3-VL regression from vllm-project#24982 (vllm-project#25814)

3809fae

Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>

tlrmchlsmth mentioned this pull request Oct 15, 2025

[Model] Add MoE support for NemotronH #25863

Merged

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (vllm-pro…

22c55b8

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix] Fix Qwen3-VL regression from vllm-project#24982 (vllm-projec…

85f7324

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (vllm-pro…

7640427

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix] Fix Qwen3-VL regression from vllm-project#24982 (vllm-projec…

10061b7

…t#25814) Signed-off-by: Roger Wang <hey@rogerw.io>

tlrmchlsmth mentioned this pull request Oct 30, 2025

[Model] Introduce Kimi Linear to vLLM #27809

Merged

5 tasks

frost-intel mentioned this pull request Nov 4, 2025

[XPU] Enable custom routing functions in IPEX for Llama4 #28004

Merged

5 tasks

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (vllm-pro…

9ee5e00

…ject#24982) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

ProExpertProg mentioned this pull request Nov 21, 2025

[Feature]: Optimize collectives in TP MoE case using torch.compile pass #29139

Open

1 task

AlfredYyong mentioned this pull request Nov 28, 2025

[Bug]: Qwen3-VL-235B-A22B-Instruct Grounding Accuracy Issue in vLLM (>= v0.11.1) #29595

Closed

1 task

wxsIcey mentioned this pull request Dec 10, 2025

[BugFix] Fix Qwen3-Next because of TP Attn + EP MoE modified vllm-project/vllm-ascend#3221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models#24982

[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models#24982
tlrmchlsmth merged 39 commits intovllm-project:mainfrom
tlrmchlsmth:tp_attn_fix_more_models

tlrmchlsmth commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

tlrmchlsmth commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tlrmchlsmth commented Sep 16, 2025 •

edited by github-actions bot

Loading